Genome Medicine
○ Springer Science and Business Media LLC
Preprints posted in the last 7 days, ranked by how well they match Genome Medicine's content profile, based on 154 papers previously published here. The average preprint has a 0.32% match score for this journal, so anything above that is already an above-average fit.
Schreiner, P. A.; Markianos, K.; Francis, M.; Despard, B.; Gorman, B. R.; Said, I.; Dong, F.; Gautam, S.; Dochtermann, D.; Shi, Y.; Devineni, P.; Kirkpatrick, C.; Khazanov, N.; Moser, J.; Million Veteran Program, ; Huang, G. D.; Muralidhar, S.; Tsao, P. S.; Pyarajan, S.
Show abstract
The Million Veteran Program (MVP) represents the largest and one of the most diverse single cohorts associated with longitudinal Electronic Health Record data (EHR) data. We profiled a subset of samples from MVP using the Illumina Infinium MethylationEPIC Beadchip (EPIC array) to generate one of the largest single cohort methylation dataset to-date. Methylation profiles were analyzed for 45,460 total individuals, with the most populous ancestries composed of 27,455 Europeans, 11,798 African Americans, and 4,859 Admixed Americans. We detail the strict quality control standards implemented to ensure the most robust method of methylation profiling of the MVP cohort. This dataset was then applied to evaluate the effects of smoking exposure on DNA methylation in MVP participants. Ancestry-stratified epigenome-wide association studies (EWAS) of smoking status (ever/never) were performed using over 750,000 probes with certifiable signal. Our multi-ancestry meta-analysis demonstrates replicability with existing EWAS and identifies 3,207 novel probe-smoking associations unlocked via the depth and breadth of data in this cohort.
Hawkey, J.; Nodari, C. S.; Iqbal, Z.; Hunt, M.; Wick, R. R.; Chong, C. E.; Jenkins, C.; Howden, B. P.; Holt, K.; Weill, F.-X.; Baker, K. S.; Ingle, D. J.
Show abstract
Shigella flexneri is the leading causative agent of shigellosis globally. The public health threat posed by S. flexneri is compounded by its emergence as a sexually transmissible infection, importance of international travel in driving dissemination, and the increasing prevalence of antimicrobial resistance (AMR). A rapid and robust computational method is needed to enhance genomic surveillance and systematically explore features of the population structure of this WHO priority pathogen, which is scalable and readily implementable across jurisdictions, particularly as vaccine development efforts are underway. Here, we present Flex-It, a genomic framework and genotyping scheme implemented in Mykrobe for S. flexneri serotypes 1-5, X & Y, compatible with previous approaches used to describe S. flexneris population structure. To develop Flex-It, we curated a retrospective dataset of 5,819 publicly available S. flexneri genomes. We characterised the global population structure for S. flexneri, exploring geographical and temporal traits, and showed the granular diversity of AMR and serotype profiles. We applied Flex-It to >13,000 genomes routinely generated by public health laboratories from Australia, the UK and the USA across a ten-year period. We found significant genotype diversity in all three locations, with the emergence of genotypes with converged resistance to all major drugs currently used for treatment. Flex-It provides an open-source, novel genotyping method that rapidly characterises S. flexneri and its ciprofloxacin resistance determinants in <1 minute from both short and long whole-genome sequencing reads. Flex-It provides the community with a standardised nomenclature to monitor the emergence and spread of S. flexneri lineages.
Wang, V.; Deng, S.; Aguilar, R.
Show abstract
BackgroundThe retired antigen hypothesis, introduced by Tuohy and colleagues, proposes that tissue-specific proteins expressed conditionally during early life or reproductive stages, then silenced in normal aging tissue, represent safe and effective cancer vaccine targets when re-expressed in tumors. To date, discovery of retired antigens has relied entirely on hypothesis-driven wet lab work, limiting throughput. MethodsHere we present RADAR (Retired Antigen Discovery and Ranking), a multi-omics computational pipeline implemented on a standard server that systematically identifies retired antigen candidates. RADAR comprises four core discovery layers integrating: 1) The Genotype-Tissue Expression Portal (GTEx) normal tissue expression, 2) TCGA tumor re-expression, 3) DNA methylation, and 4) miRNA regulatory networks, each applied sequentially to identify genes exhibiting the epigenetic and post-transcriptional hallmarks of tissue-specific retirement followed by tumor re-activation. Candidate characterization is further supported by three automated modules: 1) protein-level safety screening via the Human Protein Atlas, 2) molecular subtype enrichment analysis, and 3) cross-cancer confirmation, which execute automatically when the relevant data are available for the selected cancer type. ResultsThe pipeline independently validated known targets including alpha-lactalbumin (LALBA, the basis of the Tuohy Phase 1 triple-negative breast cancer vaccine trial) and anti-Mullerian hormone (AMH), consistent with Tuohys ovarian cancer vaccine program targeting AMHR2, and rediscovered multiple known cancer-testis antigens (MAGEA1, MAGEC1, SSX1) as positive controls. Among 4,664 initial candidates derived from GTEx, the pipeline identified 20 high-confidence retired antigen candidates passing all filters. DCAF4L2, COX7B2, TEX19, and CT83 emerge as the highest-priority novel candidates for experimental validation, demonstrating zero expression in critical somatic organs, strong epigenetic silencing, and significant re-expression across multiple cancer types. ConclusionRADAR provides the first systematic computational framework for retired antigen discovery, offering a reproducible and scalable approach to expanding the cancer immunoprevention pipeline beyond individually characterized targets. The pipeline is fully reproducible, requires no specialized hardware, and is immediately extensible to additional TCGA cancer types.
Pizzagalli, M.; Sasipalli, S.; Leary, O.; Tran, L.; Haas, B.; Tapinos, N.
Show abstract
BackgroundTransposable elements (TEs) account for over half of the human genome and are often derepressed in cancer. TEs can add cryptic splice sites, undergo exonization, and generate gene-TE fusion transcripts, but the combined effects of TEs on RNA processing and translation in glioblastoma stem cells (GSCs) remains incompletely elucidated. ResultsWe combined long-read RNA sequencing with polysome profiling in four patient-derived GSCs and two neural stem cell (NSC) controls to resolve TE-associated transcript diversity and its relationship to ribosomal engagement. Across GSCs, we identified 13,421 alternative splicing (AS) events, 3,077 of which contained TEs within 150 bp of splice junctions. AS sites proximal to TEs were associated with increased isoform switching compared to non-TE-associated AS sites (odds ratio 2.9 - 4.3). Moreover, AS isoforms generated from TE-proximal sites were more likely to exhibit altered ribosomal association (odds ratio 2.54). Directional shifts were observed, with shorter isoforms associating with monosome fractions and longer isoforms with polysome fractions. To enable systematic detection of gene - TE chimeric transcripts, we developed FuTER (Fusion TE Reporter), a long-read-based framework for identifying TE-associated fusions. Application to GSC datasets identified 78 GSC enriched fusion transcripts, several supported by breakpoint-spanning reads in polysome fractions, consistent with ribosome association. ConclusionsOur data suggest that TEs correlate with abnormal splicing activity and altered ribosome engagement in glioblastoma stem cells. By integrating long-read sequencing with polysome profiling and fusion detection, we establish a framework for analysis of TE-induced transcript diversity and its effects on cancer evolution and plasticity.
Chauquet, S.; Jiang, J.-C.; Barker, L. F.; Hunter, Z. L.; Singh, G.; Wray, N. R.; McRae, A. F.; Shah, S.
Show abstract
Drug targets supported by human genetic evidence have significantly higher approval rates, making genome-wide association studies a valuable resource for drug candidate prioritisation. Transcriptome-wide association study signature-matching is an emerging in silico approach that integrates GWAS data with expression quantitative trait loci to generate a disease gene expression signature, which is then compared against drug perturbation databases such as the Connectivity Map. Despite recent adoption, there is no consensus on optimal methodology. Here, we systematically benchmark key parameters, including TWAS method, eQTL tissue model, similarity metric, gene set size, and CMap cell line, using LDL cholesterol, familial combined hyperlipidemia, and asthma as proof-of-concept traits. We demonstrate that while TWAS signature-matching can successfully prioritise known first-line treatments, performance is highly sensitive to parameter choice; for instance, the selection of the cell line used for drug signatures alone can dramatically alter drug prioritisation. Based on these findings, we propose a best-practice framework for robust, genetically-informed drug prioritisation using TWAS signature-matching.
Gohar, Y.; Garcia, A. D.; Kichula, K. M.; Norman, P. J.; Dilthey, A. T.
Show abstract
Killer-cell immunoglobulin-like receptor (KIR) genes, key modulators of natural killer (NK) cell activity, play critical roles in immune response and disease susceptibility. Accurate KIR genotyping from short-read sequencing data remains challenging because of high sequence similarity among genes, extensive copy number variation, and substantial allelic diversity. Here, we present KIR*BLOOM, a likelihood-based approach for KIR genotyping from short-read data that models read depth and sequencing error across alternative genotype configurations. KIR*BLOOM first identifies KIR-relevant read pairs, maps them to a KIR allele database, and reduces the candidate allele space by excluding alleles unlikely to be present. It then infers gene copy number and selects alleles under the inferred copy-number constraints. Finally, variant calling is used to refine CDS sequences and identify potential novel alleles. We evaluated performance on 45 whole-genome sequencing samples with haplotype-resolved assemblies from the HPRC or HGSVC, using Immuannot-derived annotations as ground truth. KIR*BLOOM achieved 99.85% precision, 99.92% recall, and a Jaccard index of 99.77% for copy-number inference. At five-digit allele resolution, it achieved 92.73% precision, 92.69% recall, and an 87.29% Jaccard index, outperforming T1K, GraphKIR, and Geny. Together, these results demonstrate that KIR*BLOOM enables highly accurate KIR genotyping from short-read sequencing data.
Vattathil, S. M.; Duong, D. M.; Gearing, M.; Seyfried, N. T.; Wilson, R. S.; Bennett, D. A.; Woltjer, R. L.; Wingo, T. S.; Wingo, A. P.
Show abstract
Behavioral and psychological symptoms of dementia (BPSD) are common, profoundly troubling to patients and caregivers, and difficult to treat, yet their molecular underpinnings remain poorly understood. Here, we generated the first brain proteomic dataset with BPSD phenotyping, profiling the dorsolateral prefrontal cortex of 376 donors from three cohorts spanning nine BPSD domains assessed in life. Protein associations with BPSD were examined using complementary approaches - domain-specific BPSD, multi-domain BPSD, and latent factor modeling - and integrated via cross-cohort meta-analysis. Four proteins (NMT1, DCAKD, DNPH1, and HIBADH) were associated with anxiety in dementia and five proteins (ABL1, SAP18, PLXND1, CTRB2, and LDHD) with multi-domain BPSD or BPSD latent factors after adjusting for sex, age, and other covariates (FDR < 0.05). Additionally, eight protein co-expression networks were associated with BPSD across cohorts. These results link BPSD to dysregulation of synaptic signaling, protein folding, and humoral immune response, providing a molecular framework for therapeutic discovery.
Sionakidis, A.; Pinilla Alba, K.; Abraham, J.; Simidjievski, N.
Show abstract
Emerging multi-omic profiling has made it feasible to subtype disease using multiple molecular layers. However, inconsistent preprocessing, heterogeneous implementations, variable evaluation, and limited reproducibility often constrain method selection. Here, we systematically benchmark 22 publicly available unsupervised approaches for bulk data on the TCGA-BRCA cohort across five modalities (RNA-seq, miRNA, DNA methylation, copy numbers, single nucleotide polymorphisms) and validate findings in two independent datasets, enabling a multi-layered comparison of performance, heterogeneous data support and interpretability. Most approaches fuse multi-omic data to produce a two-cluster solution largely aligned with ER status, with higher-resolution approaches further refining these into four coherent subclasses (angiogenic luminal, oxidative-phosphorylation/HER2-low luminal, immune-inflamed basal-like, and hyper-proliferative basal-like). Our benchmarking results indicate that methods based on similarity networks can efficiently produce stable, reliable partitions. Matrix factorisation and Bayesian factorisation algorithms produce rich latent representations, allowing quantification of feature and modality contributions, albeit at higher computational cost. Consensus clustering can be used on a case-by-case basis and refine partitions into more robust and generalisable findings. We aggregate our insights into a decision workflow that aligns with study goals, data characteristics, and computational resources, enabling optimal analytic strategies. This comprehensive assessment provides a practical roadmap for investigators seeking to extract reproducible, biologically meaningful subtypes from complex multi-omic datasets. We higlight the different technical and practical benefits and trade-offs that shape the selection and development of multi-omic approaches applied in precision oncology.
Mavura, Y.; Crosslin, D.; Ferar, K. D.; Lawlor, J. M.; Greally, J. M.; Hindorff, L.; Jarvik, G. P.; Kalla, S.; Koenig, B. A.; Kvale, M.; Kwok, P.-Y.; Norton, M.; Plon, S. E.; Powell, B. C.; Slavotinek, A.; Thompson, M. L.; Popejoy, A. B.; Kenny, E. E.; Risch, N.
Show abstract
PurposeDiagnostic yield from exome and genome sequencing varies widely across studies. It remains unclear how much of this variation reflects patient-level factors (e.g., sex, clinical features, race/ethnicity, genetic ancestry) versus site-level practices such as sequencing modality or variant interpretation workflows. We aimed to quantify the contributions of these factors to diagnostic outcomes across five U.S. clinical sequencing sites. MethodsWe performed a cross-sectional analysis of 3,008 prenatal, neonatal, and pediatric cases from the NHGRI Clinical Sequencing Evidence-Generating Research (CSER) consortium (2017-2023). Clinical indications spanned neurodevelopmental, neurological, immunological, metabolic, craniofacial, skeletal, cardiac, prenatal, and oncologic presentations. Genetic ancestry was inferred from sequencing data, and variants were interpreted using ACMG/AMP guidelines to classify DNA-based diagnoses. Generalized linear mixed models were used to estimate associations between diagnostic yield and fixed effects (sex, prenatal status, isolated cancer, number of clinical indications, sequencing modality, race/ethnicity, and genetic ancestry), while modeling study site as a random effect to quantify between-site variation. ResultsThe overall diagnostic yield was 19.0%. Multiple clinical indications (OR=1.47, 95% CI 1.20-1.80, p<0.001) were associated with higher diagnostic yield, and male sex (OR=0.80, 95% CI 0.66-0.96, p=0.017) and prenatal status (OR=0.63, 95% CI 0.44-0.90, p=0.012) were associated with lower yield. Sequencing modality, race/ethnicity, genetic ancestry, and isolated cancer were not statistically significantly associated with diagnostic outcomes.. A model without fixed effects attributed [~]10% of variance in diagnostic yield to between-site differences. After adjusting for covariates, site-level variance decreased to 5.7%, indicating consistent variation across sites not explained by measured patient factors. ConclusionAcross five sites, patient-level clinical features influenced diagnostic yield, but substantial site-level variation remained even after adjustment. Differences in variant interpretation, or case-classification practices may contribute to this residual variability. Further efforts to increase consistency in exome- and genome-sequencing diagnostic workflows may help reduce inter-site differences.
Nabunje, R.; Guillen-Guio, B.; Hernandez-Beeftink, T.; Joof, E.; Leavy, O. C.; International IPF Genetics Consortium, ; Maher, T. M.; Molyneux, P.; Noth, I.; Urrutia, A.; Aburto, M.; Flores, C.; Jenkins, R. G.; Wain, L. V.; Allen, R. J.
Show abstract
Genome-wide association studies of idiopathic pulmonary fibrosis (IPF) have identified 35 common genetic risk loci associated with IPF susceptibility. In this study, we evaluated the effects of the reported variants in clinically curated non-European individuals. Despite limited sample sizes, we observed partial replication, limited transferability of some variants and evidence of ancestry-specific effects. The MUC5B promoter variant rs35705950 emerged as the dominant and most consistent signal across ancestries. Our findings highlight the need for larger, well-characterised studies in understudied populations to support robust discovery and translation.
Chen, H.; Wang, X.; Sun, Y.; Vanegas, N. D. P.; Rodriguez, J.; Ghobashi, A.; Ma, A.; Mora, A. L.; Rojas, M.; Ma, Q.
Show abstract
Spatial transcriptomics (ST) enables the study of cell-cell communication in native tissue context, but current methods for the ligand-receptor interaction (LRI) inference generally rely on static, distance-based assumptions. Here we present SpaFlow, a reaction-diffusion framework that models ligand diffusion, binding, dissociation, production and degradation to infer spatially resolved LRI activity and hotspots from ST data. Across paired 10x Visium and CosMx metastatic renal cell carcinoma datasets, SpaFlow outperformed existing methods in recovering spatially coherent LRIs, with inferred LRI activity showing stronger association with downstream signaling. In hepatocellular carcinoma after neoadjuvant immunotherapy, SpaFlow identified CXCL12-CXCR4 hotspots enriched at immune-rich tumor boundaries in responders. In aging mouse heart, SpaFlow resolved niche-specific pro-fibrotic and senescence-associated signaling, highlighting Postn-Itgav/Itgb5 as an additional pro-fibrotic axis and Angptl2-Pirb as a candidate mediator of inter-niche senescence-related communication. In human idiopathic pulmonary fibrosis lung, SpaFlow localized CXCL12-CXCR4 signaling between adventitial fibroblasts and CD4 T cells, CD8 T cells, and B cells in the fibrotic surrounding regions. Together, SpaFlow provides a physically informed framework for quantifying spatially constrained cell-cell communication and mechanistically interpreting signaling patterns in complex tissues.
Nicholas, M. T.; Mehta, D.; Ouyang, J.; Dawoud, A.; Ellison, C.; Westendorf, J.; Green, L. A.; Skipp, P.; Rackham, O.
Show abstract
Single-cell RNA sequencing (scRNA-seq) has transformed our ability to analyse cellular heterogeneity, enabling detailed mapping of cellular progression. Trajectory inference tools construct trajectories from scRNA-seq data, facilitating the tracing of cellular progression through developmental pathways. PathPinpointR (PPR) is a lightweight and user-friendly R package developed to predict and compare the positions of scRNA-seq samples along reference biological trajectories, such as those created from large cell atlas projects. PPR utilises sets of switching-gene events from reference trajectories as indicators of cellular progression. By applying these positional indicators to query datasets, each cell can be accurately assigned a pseudo-time value, providing predictive insight into its position along a trajectory. This information can be used to stage cells within an established developmental process, or to evaluate how different patient samples compare when mapped onto reference disease or drug response trajectories. AvailabilityPathPinpointR is available at https://github.com/moi-taiga/PathPinpointR. Contacto.j.l.rackham@soton.ac.uk
Litster, T. M.; Wilcox, R. A.; Carroll, R.; Gardner, A. E.; Nazri, N. M.; Shoubridge, C. A.; Delatycki, M. B.; Lohmann, K.; Agzarian, M.; Turella Divani, R.; Rafehi, H.; Scott, L.; Monahan, G.; Lamont, P. J.; Ashton, C.; Laing, N. G.; Ravenscroft, G.; Bahlo, M.; Haan, E.; Lockhart, P. J.; Friend, K. L.; Corbett, M. A.; Gecz, J.
Show abstract
The spinocerebellar ataxias (SCAs) are a clinically heterogenous group of neurodegenerative disorders that affect movement, vision, speech and balance. Here, we reassign the linkage of SCA30 to 14q32.13 based on a cumulative LOD score >12. Within this interval we identified a 331 kb duplication, absent in population controls and not observed in >800 unrelated individuals with genetically unresolved cerebellar ataxia. RNASeq analysis of patient-derived lymphoblastoid cell lines revealed a splice-mediated chimeric transcript resulting from the duplication event. This transcript joined exon 1 of CLMN to exon 2 of SYNE3. In silico translation predicted that this chimeric transcript would produce a short N-terminal peptide corresponding to exon 1 of CLMN and the usually untranslated region of exon 2 of SYNE3 fused to the complete and in-frame SYNE3 protein. Transient overexpression of SYNE3 or the CLMN::SYNE3 fusion protein, in both HeLa cells and mouse primary cortical neurons, resulted in equivalent cellular outcomes including altered nuclear morphology and chromosomal DNA fragmentation. SYNE3 forms part of the linker of nucleoskeleton and cytoskeleton complex and is not usually expressed in cerebellar Purkyn[e] neurons while, CLMN has a Purkyn[e] specific expression pattern within the brain. Our data suggests that ectopic expression of SYNE3 in cerebellar Purkyn[e] neurons, mediated by the CLMN promoter, leads to cerebellar atrophy and causes spinocerebellar ataxia in the SCA30 family. This is an example of Mendelian disease arising from a novel, chimeric transcript with a likely dominant negative effect. Chimeric transcripts are commonly associated with cancers, but they are not often associated with monogenic disorders. Detection of chimeric transcripts as part of structural variant analysis could increase the genetic diagnostic yield of Mendelian disorders.
Miao, Z.; Qu, Y.; Huang, S.; Laux, L.; Peters, S.; Aristel, A.; Zhang, Z.; Niedernhofer, L. J.; McMahon, A.; Kim, J.; Zhang, N.
Show abstract
Spatial transcriptomics enables the study of how cells coordinate their molecular states within tissue, providing insight into both normal function and disease processes. A key challenge is to identify gene expression programs that vary continuously across space and are coordinated between cell types. We present CoPro, a computational framework for detecting the spatially coordinated progression of cellular states. CoPro can operate in both supervised and unsupervised modes to identify gene programs that co-vary within or between cell types, and to disentangle multiple overlapping spatial patterns. CoPro can be applied to single-cell-level spatial transcriptomics datasets, including MERFISH, SeqFISH+, Xenium, and histology-imputed transcriptomic data. We demonstrate the utility of CoPro with data collected from colon, brain, liver, and kidney tissues. In the colon, CoPro separates epithelial differentiation along the crypt axis from spatially localized inflammatory signals. In the aging liver, it identifies multiple aging-associated cellular programs superimposed on anatomical zonation. In the brain, the flexible kernel design enables the decoupling of the gene expression gradient along the dorsal-ventral and medial-lateral axes. In the kidney, CoPro identifies tubule-vasculature coordination that is essential in nephron function. These results demonstrate CoPros utility for analyzing spatial coordination of gene expression in complex tissues and disentangling overlapping biological processes, such as anatomical organization and disease-associated variation.
Kirby, B.; Di Bernardo, M.; Cheeseman, I. M.; Blainey, P.
Show abstract
Optical pooled screens (OPS) are bottlenecked by labor-intensive in situ sequencing and analysis protocols. Here we present OttoSeq, an automated OPS platform combining the Otto2 fluid handling system with the Brieflow analysis pipeline. We utilized OttoSeq to complete a genome-wide cell painting screen in eight days, sampling more than 5 million high-quality cells across 21,732 gene knockout perturbations (224 cells per gene) and interpreting 320 functional gene clusters.
Ota, K.; Ito, T.; Shimizu, H.
Show abstract
A substantial proportion of cancer patients fail to benefit from their prescribed combination regimens, yet identifying superior alternatives from the vast pharmacological space prior to treatment failure remains an unsolved clinical challenge. Existing computational approaches either rely on multi-omics profiles unavailable in standard oncological practice or reduce drug efficacy to scalar metrics that discard the dose-dependent resolution essential for therapeutic optimization. Here, we present XACT, a hierarchical deep learning framework that reconstructs full dose-dependent drug responses for both monotherapy and drug combinations using only clinically accessible transcriptomic profiles. By leveraging an asymmetric X-Linear Attention mechanism that models second-order interactions between molecular drug substructures and intracellular signaling pathway activities, XACT captures concentration-dependent pharmacodynamics with state-of-the-art accuracy and generalizability to unseen transcriptomic landscapes. When applied to the TCGA pan-cancer cohort, XACT-derived resistance scores were significantly associated with clinical treatment outcomes and stratified overall survival as the strongest independent prognostic factor after multivariate adjustment for tumor stage and cancer type. Systematic virtual screening revealed therapeutic vulnerabilities and nominated alternative regimens for treatment-refractory sarcoma and pancreatic adenocarcinoma. These results establish XACT as a scalable, interpretable, and clinically translatable framework that advances precision oncology from computational prediction toward data-driven therapeutic prescription.
Di Scipio, M.; Man, A.; Lali, R.; Wu, J.; Le, A.; Franks, P. W.; Pare, G.
Show abstract
Genome-guided dietary advice is a goal of precision nutrition. However, the contribution of gene-diet interactions (GxDs) to disease risk remains unclear, hindering the identification of diet-outcome pairs more likely amenable to genetic-based recommendations. We thus implemented a two-step approach: first, we comprehensively assessed the contributions of genome-wide GxDs to cardiometabolic outcomes across a broad array of dietary exposures in UK Biobank participants (N = 141,144 to 325,989). Second, we selected the 20 significant diet-outcome pairs from the 713 pairs tested (p < 7.0 x 10-5) and derived GxD polygenic scores. In an independent sample, all scores were nominally associated with their corresponding outcomes, with 12 of 20 polygenic scores Bonferroni significant (p < 0.0025). Further analyses revealed GxD polygenic scores were associated with clinical outcomes such as incident gout, suggesting translational potential. Altogether, these results showcase the promise of GxD scores to inform precision nutrition.
Sakaue, S.; Yang, D.; Zhang, H.; Posner, D.; Rodriguez, Z.; Love, Z.; Cui, J.; Budu-Aggrey, A.; Ho, Y.-L.; Costa, L.; Monach, P.; Huang, S.; Ishigaki, K.; Melley, C.; Tanukonda, V.; Sangar, R.; Maripuri, M.; Sweet, S. M.; Panickan, V.; McDermott, G.; Hanberg, J. S.; Riley, T.; Laufer, V.; Okada, Y.; Scott, I.; Bridges, S. L.; Baker, J.; VA Million Veteran Program, ; Wilson, P. W.; Gaziano, J. M.; Hong, C.; Verma, A.; Cho, K.; Huffman, J. E.; Cai, T.; Raychaudhuri, S.; Liao, K. P.
Show abstract
Rheumatoid arthritis (RA) is a heritable and common autoimmune condition. To date, most genetic associations were derived from individuals with either European or East Asian ancestries. Here, we applied a multimodal automated phenotyping strategy to define RA and performed a genome-wide association study (GWAS) of RA in the Million Veteran Program (MVP), including underrepresented African American (AFR) and Admixed American (AMR) populations. Meta-analyses with previous RA cohorts identified 152 autosomal genome-wide significant loci, of which 31 were novel. Inclusion of multi-ancestry data dramatically improved fine-mapping resolution. Functional characterization of these loci using single-cell transcriptomic and chromatin data suggested new RA genes such as CHD7 and CD247. We identified underappreciated functional roles of fine-grained immune cell states other than T cells, such as B cell and myeloid cell states. We observed that multi-ancestry polygenic risk scores using our data demonstrated better predictive ability, especially for AFR and AMR populations.
Razavi, M.; Tellapragada, C.; Giske, C. G.
Show abstract
Cefiderocol uptake in Enterobacterales depends partly on TonB-dependent catecholate transporters, including CirA, yet the functional interpretation of CirA missense variation remains limited by an absence of large experimental phenotype datasets. Here we describe a structure-informed Siamese graph neural network (GNN) framework designed to prioritise CirA missense variants that are likely to impair transporter function and thereby contribute to reduced cefiderocol susceptibility. Because large experimental datasets of CirA missense phenotypes are not available, we trained the model on a synthetic mutant set generated from structurally motivated rules applied to the CirA reference structure (AlphaFold model, UniProt P17315). Each residue was represented using protein language model embeddings, backbone geometry, and amino-acid identity, and paired wild-type and mutant graphs were compared through a shared encoder. On synthetic held-out benchmarks, the model achieved strong classification performance on a position-held-out split (macro-F1 = 0.989 against synthetic labels). Applied to a collection of Escherichia coli CirA protein sequences, the framework prioritised a subset of variants as high-confidence non-benign candidates and assigned many others to review or abstain categories, reflecting predictive uncertainty outside the synthetic training distribution. A post-hoc severity-ranking scheme triages disruptive candidates for follow-up. This framework demonstrates that structure-informed synthetic data generation paired with Siamese GNN inference can bridge the gap between sequence-level genomic surveillance and mechanistic functional prediction of outer-membrane transporter variants.
Carver, S.; Perea-Chamblee, T.; Taraszka, K.; Moon, I.; Yu, X.; Ding, Y.; Carrot-Zhang, J.; Gusev, A.
Show abstract
Genome-wide association studies (GWAS) have advanced the understanding of germline susceptibility in common cancers, yet rare malignancies remain underexplored due to limited sample sizes. To address this gap, we conducted large-scale GWAS across 20 rare cancer types and meta-analyzed results from three cohorts: two clinically sequenced cancer center cohorts and an independent population biobank, comprising over 480,000 individuals. We identified nine novel genome-wide significant susceptibility loci with moderate to large effect sizes that replicated across cohorts in eight rare malignancies, including myelodysplastic syndromes (MDS), germ cell tumors, gastrointestinal stromal tumor (GIST), gastrointestinal neuroendocrine tumors, anal cancer (ANSC), non-melanoma skin cancer, mesothelioma, and hepatobiliary cancer. Among the strongest associations were loci in MDS near API5 (OR = 2.21, p = 1.06x10-8), in GIST near SLC6A18 and TERT (OR = 1.91, p = 8.20x10-50), and in ANSC near HLA-DQA2 (OR = 1.58, p = 5.50x10-18). The GIST risk variant was enriched in tumors harboring somatic KIT mutations (OR = 2.21, p = 6.5x10-4) and was associated with worse survival among carriers with KIT-mutant tumors (hazard ratio = 4.06, p = 0.015), implicating germline-somatic interplay in tumor initiation and progression. The ANSC risk variant was associated with HPV infection (OR = 1.44, p = 3.19x10-5), supporting a host-viral interaction in HPV-driven tumorigenesis. The MDS risk variant at the API5 locus was associated with altered neutrophil counts, suggesting a role in hematopoietic dysregulation in disease pathogenesis. We further identified novel, independent associations with mesothelioma, GIST, and hepatobiliary cancer at the 5p15.33 locus encompassing TERT, consistent with pleiotropic genetic effects at a core telomere-maintenance gene. Collectively, these findings demonstrate that integrating clinically ascertained sequencing cohorts with population biobanks substantially enhances germline discovery in rare cancers, enabling identification of high-confidence susceptibility loci and facilitating downstream biological interpretation through linked somatic, viral, and clinical data. This framework provides a scalable approach for characterizing inherited susceptibility across diverse rare malignancies.